Skip to content

feat: expand eval dataset with edge and complex cases and refine prompts#458

Open
cocosheng-g wants to merge 33 commits intomainfrom
feat/eval-issue-219-triage
Open

feat: expand eval dataset with edge and complex cases and refine prompts#458
cocosheng-g wants to merge 33 commits intomainfrom
feat/eval-issue-219-triage

Conversation

@cocosheng-g
Copy link
Collaborator

@cocosheng-g cocosheng-g commented Feb 5, 2026

This PR continues the work on issue #219 by expanding the evaluation datasets and refining the workflow prompts.

📊 Evaluation Results (Post-Tuning)

Workflow Previous Pass Rate Current Pass Rate Improvement
Issue Triage 75% 100% (20/20) +25%
Issue Fixer ~73% 100% (Confirmed Validation) Improved Guardrails

Changes:

Expanded Evaluation Datasets: Added 30+ edge, complex, and real-life cases across triage, fixer, and pr-review.

Prompt Refinements:

  • Issue Triage: Improved robustness against spam and ambiguous reports. Now correctly handles "It broke" (bug) vs "Help" (ignore).
  • Issue Fixer: Added a validation step (Step 1.5) to proactively identify impossible or out-of-scope requests (e.g., IE6 support).
  • Mock Infrastructure: Updated the mock MCP server to provide realistic data for new evaluation scenarios (race conditions, architectural violations, security risks).

Verification: All evaluations have been verified to pass.

- Implement Isolated `TestRig` for environment-safe, concurrent evaluations.
- Add gold-standard datasets for Issue Triage, Scheduled Triage, Assistant, and Issue Fixer.
- Implement Mock MCP Server for high-fidelity PR Review benchmarking.
- Add nightly evaluation workflow with multi-model strategy matrix.
- Automated aggregate reporting for GitHub Job Summaries.

Next Steps:
- Expand evaluation datasets with more edge cases.
- Fine-tune workflow prompts based on baseline quality analysis.

Refs: #219
- Added 30+ cases (edge, complex, real-life) across gemini-triage, gemini-issue-fixer, and gemini-review.
- Refined triage prompt to handle spam, ambiguity, and vague reports more robustly.
- Added a validation step to issue-fixer prompt to handle impossible or out-of-scope requests.
- Updated mock MCP server to support new evaluation scenarios including race conditions and architectural violations.
- Improved evaluation scripts for better tool call detection in namespaces.
- Verified all evaluations pass with the updated prompts.
@cocosheng-g cocosheng-g requested review from a team as code owners February 5, 2026 17:58
@cocosheng-g cocosheng-g requested review from MJjainam, R2wenD2, bdmorgan and verbanicm and removed request for a team February 5, 2026 17:58
@gemini-cli
Copy link
Contributor

gemini-cli bot commented Feb 5, 2026

🤖 Hi @cocosheng-g, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

- Update triage guidelines for stricter handling of spam and ambiguity.
- Refine fixer validation step to use explicit keywords for out-of-scope cases.
- Improves evaluation pass rates for edge cases.
@cocosheng-g cocosheng-g requested review from jerop and kschaab February 5, 2026 19:20
Base automatically changed from feat/eval-framework to main February 9, 2026 15:23
Resolved conflicts:
- package.json: Use 'vitest' directly in test script (from main)
- .github/workflows/evals-nightly.yml: Use 'Install Gemini CLI' step and 'always()' condition (from main)
- evals/data/*.json: Keep expanded datasets (from HEAD)
- evals/pr-review.eval.ts: Keep updated test logic (from HEAD)
- evals/mock-mcp-server.ts: Manually merged new mock data and tool handlers
- Run tests sequentially to reduce flakiness and avoid API rate limits.

- Enable mock GitHub MCP server for issue-fixer evaluation to match prompt instructions.

- Proactively create 'chats' directory in test rig to prevent 'ENOENT' errors during chat recording.

- Refine structural checks to handle out-of-scope/impossible requests and account for alternative git/issue tool usage.

- Update expected plan keywords in evaluation datasets.
- Broaden hasExploration check in issue-fixer.eval.ts to include MCP/extension tools.
- Add search_code and get_file_contents to mock-mcp-server.ts.
- Add a 2s delay before reading telemetry logs across all evals to prevent race conditions in CI.
- Fixes failures observed with gemini-3-pro-preview in CI.
- Increase testTimeout to 15m to handle complex cross-file refactor tasks.
- Add 'search' to tool exploration keywords for broader detection.
Signed-off-by: Coco Sheng <cocosheng@google.com>
@cocosheng-g cocosheng-g enabled auto-merge (squash) February 26, 2026 04:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant